pacman::p_load(sf, tidyverse, tmap, httr, rvest, sfdep, readxl, jsonlite, olsrr, corrplot, ggpubr, GWmodel, kableExtra, plotly, ggthemes, broom, devtools)Take-Home_Ex03
1 Overview
The price of housing is affected by many factors such as the general economy of a country or inflation rate. These factors can be further divided to structural and locational factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.
Conventional, housing resale prices predictive models were built by using Ordinary Least Square (OLS) method. However, this method failed to take into consideration that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998). In view of this limitation, Geographical Weighted Models were introduced for calibrating predictive model for housing resale prices.
1.1 Data Used
| Type | Name | Format | Source |
|---|---|---|---|
| Aspatial | HDB Flat Resale Prices | csv | data.gov |
| Geospatial | Master Plan 2014 Subzone Boundary | shp | data.gov |
| Locational (w geo-coord) | Eldercare services | shp | data.gov |
| Locational (w geo-coord) | Park facilities | geojson | data.gov |
| Locational (w geo-coord) | Hawker centres | geojson | data.gov |
| Locational (w geo-coord) | Supermarkets | geojson | data.gov |
| Locational (w geo-coord) | Bus Stops | shp | Datamall.lta.gov.sg |
| Locational (w geo-coord) | MRT | shp | Datamall.lta.gov.sg |
| Locational (w geo-coord) | Childcare | csv | dataportal.asia |
| Locational (no geo-coord) | Kindergartens | csv | dataportal.asia |
| Locational (no geo-coord) | Primary school | csv | data.gov |
| Locational (no geo-coord) | CBD | Google search | |
| Locational (no geo-coord) | Shopping Malls | list | Wikipedia |
| Locational (no geo-coord) | Good primary school | list | Local Salary Forum |
1.2 Packages
[References taken from Take-home Exercise 3 by NOR AISYAH BINTE AJIT.]
[References taken from Take-Home Exercise 3: Hedonic Pricing Models for Resale Prices of Public Housing in Singapore by MEGAN SIM TZE YEN.]
sf: used for importing, managing, and processing geospatial data
tidyverse: a collection of packages for data science tasks
tmap: used for creating thematic maps, such as choropleth and bubble maps
sfdep: used to create spatial weights matrix objects, global and local spatial autocorrelation statistics and related calculations (e.g. spatially lag attributes)
httr: used to make API calls, such as a GET request
jsonlite: a JSON parser that can convert from JSON to the appropraite R data types
rvest: A new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup.
- In this analysis, it will be used to scrape data for Shopping malls and Good primary schools
Tidyverse packages:
readr for importing delimited files (.csv)
readxl for importing Excel worksheets (.xlsx) - note that it has to be loaded explicitly as it is not a core tidyverse package
tidyr for manipulating and tidying data
dplyr for wrangling and transforming data
ggplot2 for visualising data
Building + visualising hedonic pricing models:
olsrr: used for building least squares regression models
coorplot + ggpubr: both are used for multivariate data visualisation & analysis
GWmodel: provides a collection of localised spatial statistical methods, such as summary statistics, principal components analysis, discriminant analysis and various forms of GW regression
Visualisations:
devtools: used for installing any R packages which is not available in RCRAN. In this exercise, I will be installing using devtools to install the package xaringanExtra which is still under development stage.
kableExtra: an extension of kable, used for table customisation
plotly: used for creating interactive web graphics, and can be used in conjunction with ggplot2 with the
ggplotly()functionggthemes: an extension of ggplot2, with more advanced themes for plotting
1.3 Import Packages
devtools::install_github("gadenbuie/xaringanExtra")
library(xaringanExtra)2 Aspatial Data Wrangling
[References taken from Take-home Exercise 3 by NOR AISYAH BINTE AJIT.]
resale <- read_csv("data/aspatial/resale-flat-prices.csv")
glimpse(resale)Rows: 148,277
Columns: 11
$ month <chr> "2017-01", "2017-01", "2017-01", "2017-01", "2017-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block <chr> "406", "108", "602", "465", "601", "150", "447", "…
$ street_name <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 4", "ANG MO K…
$ storey_range <chr> "10 TO 12", "01 TO 03", "01 TO 03", "04 TO 06", "0…
$ floor_area_sqm <dbl> 44, 67, 67, 68, 67, 68, 68, 67, 68, 67, 68, 67, 67…
$ flat_model <chr> "Improved", "New Generation", "New Generation", "N…
$ lease_commence_date <dbl> 1979, 1978, 1980, 1980, 1980, 1981, 1979, 1976, 19…
$ remaining_lease <chr> "61 years 04 months", "60 years 07 months", "62 ye…
$ resale_price <dbl> 232000, 250000, 262000, 265000, 265000, 275000, 28…
2.1 Filter the data
The study should focus on either three-room, four-room or five-room flat and transaction period should be from 1st January 2021 to 31st December 2022. The test data should be January and February 2023 resale prices.
For this project, I will be going with 4 room flats.
rs_subset <- filter(resale,flat_type == "4 ROOM") %>%
filter(month >= "2021-01" & month <= "2022-12")glimpse(rs_subset)Rows: 23,656
Columns: 11
$ month <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", …
$ block <chr> "547", "414", "509", "467", "571", "134", "204", "…
$ street_name <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 10", "ANG MO …
$ storey_range <chr> "04 TO 06", "01 TO 03", "01 TO 03", "07 TO 09", "0…
$ floor_area_sqm <dbl> 92, 92, 91, 92, 92, 98, 92, 92, 92, 92, 92, 109, 9…
$ flat_model <chr> "New Generation", "New Generation", "New Generatio…
$ lease_commence_date <dbl> 1981, 1979, 1980, 1979, 1979, 1978, 1977, 1978, 19…
$ remaining_lease <chr> "59 years", "57 years 09 months", "58 years 06 mon…
$ resale_price <dbl> 370000, 375000, 380000, 385000, 410000, 410000, 41…
unique(rs_subset$month) [1] "2021-01" "2021-02" "2021-03" "2021-04" "2021-05" "2021-06" "2021-07"
[8] "2021-08" "2021-09" "2021-10" "2021-11" "2021-12" "2022-01" "2022-02"
[15] "2022-03" "2022-04" "2022-05" "2022-06" "2022-07" "2022-08" "2022-09"
[22] "2022-10" "2022-11" "2022-12"
unique(rs_subset$flat_type)[1] "4 ROOM"
2.2 Transform resale data
After checking the correctly filtered out data, next is transforming the data,
2.2.1 New columns
rs_transform <- rs_subset %>%
mutate(rs_subset, address = paste(block,street_name)) %>%
mutate(rs_subset, remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
mutate(rs_subset, remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11)))head(rs_transform)# A tibble: 6 × 14
month town flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 2021-01 ANG MO … 4 ROOM 547 ANG MO… 04 TO … 92 New Ge… 1981 59 yea…
2 2021-01 ANG MO … 4 ROOM 414 ANG MO… 01 TO … 92 New Ge… 1979 57 yea…
3 2021-01 ANG MO … 4 ROOM 509 ANG MO… 01 TO … 91 New Ge… 1980 58 yea…
4 2021-01 ANG MO … 4 ROOM 467 ANG MO… 07 TO … 92 New Ge… 1979 57 yea…
5 2021-01 ANG MO … 4 ROOM 571 ANG MO… 07 TO … 92 New Ge… 1979 57 yea…
6 2021-01 ANG MO … 4 ROOM 134 ANG MO… 07 TO … 98 New Ge… 1978 56 yea…
# … with 4 more variables: resale_price <dbl>, address <chr>,
# remaining_lease_yr <int>, remaining_lease_mth <int>, and abbreviated
# variable names ¹flat_type, ²street_name, ³storey_range, ⁴floor_area_sqm,
# ⁵flat_model, ⁶lease_commence_date, ⁷remaining_lease
2.2.2 Add up the remaining lease in months
- Replace NA values in remaining_lease_mth with the value 0 using is.na() function
- Multiply remaining_lease_yr by 12 to convert it to months unit
- Create remaining_lease_mths column using mutate function of dplyr package which contains the summation of the remaining_lease_yr and remaining_lease_mths using rowSums() function of base R package
- Select required columns for analysis
rs_transform$remaining_lease_mth[is.na(rs_transform$remaining_lease_mth)] <- 0
rs_transform$remaining_lease_yr <- rs_transform$remaining_lease_yr * 12
rs_transform <- rs_transform %>%
mutate(rs_transform, remaining_lease_mths = rowSums(rs_transform[, c("remaining_lease_yr", "remaining_lease_mth")])) %>%
select(month, town, address, block, street_name, flat_type, storey_range, floor_area_sqm, flat_model,
lease_commence_date, remaining_lease_mths, resale_price)head(rs_transform)# A tibble: 6 × 12
month town address block stree…¹ flat_…² store…³ floor…⁴ flat_…⁵ lease…⁶
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 2021-01 ANG MO … 547 AN… 547 ANG MO… 4 ROOM 04 TO … 92 New Ge… 1981
2 2021-01 ANG MO … 414 AN… 414 ANG MO… 4 ROOM 01 TO … 92 New Ge… 1979
3 2021-01 ANG MO … 509 AN… 509 ANG MO… 4 ROOM 01 TO … 91 New Ge… 1980
4 2021-01 ANG MO … 467 AN… 467 ANG MO… 4 ROOM 07 TO … 92 New Ge… 1979
5 2021-01 ANG MO … 571 AN… 571 ANG MO… 4 ROOM 07 TO … 92 New Ge… 1979
6 2021-01 ANG MO … 134 AN… 134 ANG MO… 4 ROOM 07 TO … 98 New Ge… 1978
# … with 2 more variables: remaining_lease_mths <dbl>, resale_price <dbl>, and
# abbreviated variable names ¹street_name, ²flat_type, ³storey_range,
# ⁴floor_area_sqm, ⁵flat_model, ⁶lease_commence_date
2.3 Retrieve Postal Codes and Coordinates of Addresses
This section aims to get the postal codes and coordinates which is needed for the locational factors without geographical coordinates.
2.3.1 Create a list storing unique addresses
- Use unique() function to extract the unique addresses then use sort() function to sort the unique vector.
add_list <- sort(unique(rs_transform$address))2.4 Create function to retrieve coordinates from OneMap.Sg API
get_coords <- function(add_list){
# Create a data frame to store all retrieved coordinates
postal_coords <- data.frame()
for (i in add_list){
#print(i)
r <- GET('https://developers.onemap.sg/commonapi/search?',
query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
data <- fromJSON(rawToChar(r$content))
found <- data$found
res <- data$results
# Create a new data frame for each address
new_row <- data.frame()
# If single result, append
if (found == 1){
postal <- res$POSTAL
lat <- res$LATITUDE
lng <- res$LONGITUDE
new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
res_sub <- res[res$POSTAL != "NIL", ]
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
}
else{
top1 <- head(res_sub, n = 1)
postal <- top1$POSTAL
lat <- top1$LATITUDE
lng <- top1$LONGITUDE
new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
}
}
else {
new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
}
# Add the row
postal_coords <- rbind(postal_coords, new_row)
}
return(postal_coords)
}2.4.1 Call get_coords function to retrieve resale coordinates
coords <- get_coords(add_list)2.4.2 Inspect results
coords[(is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ] address postal latitude longitude
1305 215 CHOA CHU KANG CTRL NIL 1.38308302434129 103.747076627693
It seems like there is 1 address that does not contain geographic coordinates. After further investigation, I will be input in
coords$postal[coords$postal=="NIL"] <- "680215"coords[(is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ][1] address postal latitude longitude
<0 rows> (or 0-length row.names)
great, no more NIL postal codes.
2.4.3 Combine resale and coordinates data
Now, we need to combine the data
rs_coords <- left_join(rs_transform, coords, by = c('address' = 'address'))head(rs_coords)# A tibble: 6 × 15
month town address block stree…¹ flat_…² store…³ floor…⁴ flat_…⁵ lease…⁶
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 2021-01 ANG MO … 547 AN… 547 ANG MO… 4 ROOM 04 TO … 92 New Ge… 1981
2 2021-01 ANG MO … 414 AN… 414 ANG MO… 4 ROOM 01 TO … 92 New Ge… 1979
3 2021-01 ANG MO … 509 AN… 509 ANG MO… 4 ROOM 01 TO … 91 New Ge… 1980
4 2021-01 ANG MO … 467 AN… 467 ANG MO… 4 ROOM 07 TO … 92 New Ge… 1979
5 2021-01 ANG MO … 571 AN… 571 ANG MO… 4 ROOM 07 TO … 92 New Ge… 1979
6 2021-01 ANG MO … 134 AN… 134 ANG MO… 4 ROOM 07 TO … 98 New Ge… 1978
# … with 5 more variables: remaining_lease_mths <dbl>, resale_price <dbl>,
# postal <chr>, latitude <chr>, longitude <chr>, and abbreviated variable
# names ¹street_name, ²flat_type, ³storey_range, ⁴floor_area_sqm,
# ⁵flat_model, ⁶lease_commence_date
2.4.4 Handle invalid addresses
By replacing sub string in invalid addresses in the address and extract rows with addresses containing SAINT GEORGE’S
rs_coords$address <- sub("ST. GEORGE'S", "SAINT GEORGE'S", rs_coords$address)
rs_invalid <- rs_coords[grepl("SAINT GEORGE'S", rs_coords$address), ]glimpse(rs_invalid)Rows: 32
Columns: 15
$ month <chr> "2021-01", "2021-03", "2021-03", "2021-05", "2021…
$ town <chr> "KALLANG/WHAMPOA", "KALLANG/WHAMPOA", "KALLANG/WH…
$ address <chr> "4B SAINT GEORGE'S LANE", "18 SAINT GEORGE'S RD",…
$ block <chr> "4B", "18", "18", "4B", "19", "5", "9", "4B", "5"…
$ street_name <chr> "ST. GEORGE'S LANE", "ST. GEORGE'S RD", "ST. GEOR…
$ flat_type <chr> "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM",…
$ storey_range <chr> "10 TO 12", "04 TO 06", "10 TO 12", "04 TO 06", "…
$ floor_area_sqm <dbl> 105, 91, 91, 106, 91, 93, 84, 105, 93, 91, 91, 92…
$ flat_model <chr> "Model A", "New Generation", "New Generation", "M…
$ lease_commence_date <dbl> 1996, 1984, 1984, 1996, 1984, 1981, 1985, 1996, 1…
$ remaining_lease_mths <dbl> 899, 754, 754, 895, 749, 707, 753, 893, 702, 746,…
$ resale_price <dbl> 585000, 465000, 475000, 590000, 440000, 580000, 4…
$ postal <chr> "321004", "320018", "320018", "321004", "320019",…
$ latitude <chr> "1.32229102470373", "1.32406159729828", "1.324061…
$ longitude <chr> "103.860174104007", "103.862833773877", "103.8628…
There are 32 rows that contains SAINT GEORGE’S as street name but has a substring replaced in the address.
2.4.4.1 Create unique list of addresses again
add_list <- sort(unique(rs_invalid$address))2.4.4.2 Call get_coords to retrieve resale coordinates again
rs_invalid_coords <- get_coords(add_list)2.4.4.3 Inspect results again
rs_invalid_coords[(is.na(rs_invalid_coords$postal) | is.na(rs_invalid_coords$latitude) | is.na(rs_invalid_coords$longitude)), ][1] address postal latitude longitude
<0 rows> (or 0-length row.names)
Yep, the results shows no invalid coordinates now.
2.4.4.4 Combine rs_invalid_coords with rs_coords data
rs_coords_final <- rs_coords %>%
left_join(rs_invalid_coords, by = c("address")) %>%
mutate(latitude = ifelse(is.na(postal.x), postal.y, postal.x)) %>%
mutate(latitude = ifelse(is.na(latitude.x), latitude.y, latitude.x)) %>%
mutate(longitude = ifelse(is.na(longitude.x), longitude.y, longitude.x)) %>%
select(-c(postal.x, latitude.x, longitude.x, postal.y, latitude.y, longitude.y))head(rs_coords_final)# A tibble: 6 × 14
month town address block stree…¹ flat_…² store…³ floor…⁴ flat_…⁵ lease…⁶
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 2021-01 ANG MO … 547 AN… 547 ANG MO… 4 ROOM 04 TO … 92 New Ge… 1981
2 2021-01 ANG MO … 414 AN… 414 ANG MO… 4 ROOM 01 TO … 92 New Ge… 1979
3 2021-01 ANG MO … 509 AN… 509 ANG MO… 4 ROOM 01 TO … 91 New Ge… 1980
4 2021-01 ANG MO … 467 AN… 467 ANG MO… 4 ROOM 07 TO … 92 New Ge… 1979
5 2021-01 ANG MO … 571 AN… 571 ANG MO… 4 ROOM 07 TO … 92 New Ge… 1979
6 2021-01 ANG MO … 134 AN… 134 ANG MO… 4 ROOM 07 TO … 98 New Ge… 1978
# … with 4 more variables: remaining_lease_mths <dbl>, resale_price <dbl>,
# latitude <chr>, longitude <chr>, and abbreviated variable names
# ¹street_name, ²flat_type, ³storey_range, ⁴floor_area_sqm, ⁵flat_model,
# ⁶lease_commence_date
2.5 Write file to rds
rs_coords_rds <- write_rds(rs_coords_final, "data/aspatial/rds/rs_coords.rds")2.6 Read rs_coords RDS file
rs_coords <- read_rds("data/aspatial/rds/rs_coords.rds")glimpse(rs_coords)Rows: 23,656
Columns: 14
$ month <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO…
$ address <chr> "547 ANG MO KIO AVE 10", "414 ANG MO KIO AVE 10",…
$ block <chr> "547", "414", "509", "467", "571", "134", "204", …
$ street_name <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 10", "ANG MO…
$ flat_type <chr> "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM",…
$ storey_range <chr> "04 TO 06", "01 TO 03", "01 TO 03", "07 TO 09", "…
$ floor_area_sqm <dbl> 92, 92, 91, 92, 92, 98, 92, 92, 92, 92, 92, 109, …
$ flat_model <chr> "New Generation", "New Generation", "New Generati…
$ lease_commence_date <dbl> 1981, 1979, 1980, 1979, 1979, 1978, 1977, 1978, 1…
$ remaining_lease_mths <dbl> 708, 693, 702, 695, 689, 681, 661, 682, 692, 692,…
$ resale_price <dbl> 370000, 375000, 380000, 385000, 410000, 410000, 4…
$ latitude <chr> "1.37420951743562", "1.36390466431674", "1.374000…
$ longitude <chr> "103.858209667888", "103.853913839503", "103.8501…
2.6.1 Assign and Transform CRS and Check
rs_coords_sf <- st_as_sf(rs_coords,
coords = c("longitude",
"latitude"),
crs=4326) %>%
st_transform(crs = 3414)st_crs(rs_coords_sf)Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
2.6.2 Check for invalid geometries
length(which(st_is_valid(rs_coords_sf) == FALSE))[1] 0
No invalid geometries
2.6.3 Plot hdb resale points
tmap_mode("view")
tm_shape(rs_coords_sf)+
tm_dots(col="blue", size = 0.02)